Biostat 212a Homework 4

Due Mar. 5, 2024 @ 11:59PM

Author

Yang An and UID: 106332601

Published

February 29, 2024

1 ISL Exercise 8.4.3 (10pts)

  1. Consider the Gini index, classification error, and entropy in a simple classification setting with two classes. Create a single plot that dis- plays each of these quantities as a function of pˆm1. The x-axis should display pˆm1, ranging from 0 to 1, and the y-axis should display the value of the Gini index, classification error, and entropy. Hint: In a setting with two classes, pˆm1 = 1 − pˆm2. You could make this plot by hand, but it will be much easier to make in R.
p1 <- seq(0, 1, 0.01)
p2 <- 1 - p1
gini <- 2 * p1 * p2
class.error <- 1 - pmax(p1, p2)
entropy <- -pmax(p1, p2) * log2(pmax(p1, p2)) - pmin(p1, p2) * log2(pmin(p1, p2))
matplot(p1, cbind(gini, class.error, entropy), ylab = "Gini index, Classification error, Entropy", type = "l", col = c("green", "blue", "orange"))
legend("topright",legend=c("Gini index","Class.error", "Entropy"),pch=20,col=c("green", "blue", "orange"))

2 ISL Exercise 8.4.4 (10pts)

FIGURE 8.14. Left: A partition of the predictor space corresponding to Exer- cise 4a. Right: A tree corresponding to Exercise 4b. This question relates to the plots in Figure 8.14. (a) Sketch the tree corresponding to the partition of the predictor space illustrated in the left-hand panel of Figure 8.14. The num- bers inside the boxes indicate the mean of Y within each region. If X1≥1 then 5, else if X2≥1 then 15, else if X1<0 then 3, else if X2<0 then 10, else 0.

library(knitr)
include_graphics("/Users/yangan/Desktop/212A/212a-hw/hw4/8.4.4(a).jpg")

  1. Create a diagram similar to the left-hand panel of Figure 8.14, using the tree illustrated in the right-hand panel of the same figure. You should divide up the predictor space into the correct regions, and indicate the mean for each region.
# (b)
par(xpd = NA)
plot(NA, NA, type = "n", xlim = c(-2, 2), ylim = c(-3, 3), xlab = "X1", ylab = "X2")
# X2 < 1
lines(x = c(-2, 2), y = c(1, 1))
# X1 < 1 with X2 < 1
lines(x = c(1, 1), y = c(-3, 1))
text(x = (-2 + 1)/2, y = -1, labels = c(-1.8))
text(x = 1.5, y = -1, labels = c(0.63))
# X2 < 2 with X2 >= 1
lines(x = c(-2, 2), y = c(2, 2))
text(x = 0, y = 2.5, labels = c(2.49))
# X1 < 0 with X2<2 and X2>=1
lines(x = c(0, 0), y = c(1, 2))
text(x = -1, y = 1.5, labels = c(-1.06))
text(x = 1, y = 1.5, labels = c(0.21))

3 ISL Exercise 8.4.5 (10pts)

  1. Suppose we produce ten bootstrapped samples from a data set containing red and green classes. We then apply a classification tree to each bootstrapped sample and, for a specific value of X, produce 10 estimates of P(Class is Red|X): 0.1,0.15,0.2,0.2,0.55,0.6,0.6,0.65,0.7, and0.75. There are two common ways to combine these results together into a single class prediction. One is the majority vote approach discussed in this chapter. The second approach is to classify based on the average probability. In this example, what is the final classification under each of these two approaches?
p <- c(0.1, 0.15, 0.2, 0.2, 0.55, 0.6, 0.6, 0.65, 0.7, 0.75)
# Average probability
mean(p)
[1] 0.45

In this case, the most common prediction is 0.6, which occurs twice. With the majority vote approach, we classify X as Red as it is the most commonly occurring class among the 10 predictions (6 for Red vs 4 for Green). With the average probability approach, we classify X as Green as the average of the 10 probabilities is 0.45.

4 ISL Lab 8.3. Boston data set (30pts)

Follow the machine learning workflow to train regression tree, random forest, and boosting methods for predicting medv. Evaluate out-of-sample performance on a test set.

rm(list = ls())
library(GGally)
library(gtsummary)
library(ranger)
library(tidyverse)
library(tidymodels)
library(ISLR2)
library(MASS)
library(tidymodels)
library(rpart)
library(rpart.plot)
library(vip)
library(randomForest)
library(gbm)
library(xgboost)
# Load the Boston data set
data(Boston)
head(Boston)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
  medv
1 24.0
2 21.6
3 34.7
4 33.4
5 36.2
6 28.7
Boston %>% tbl_summary()
Characteristic N = 5061
crim 0.3 (0.1, 3.7)
zn 0 (0, 13)
indus 9.7 (5.2, 18.1)
chas 35 (6.9%)
nox 0.54 (0.45, 0.62)
rm 6.21 (5.89, 6.62)
age 78 (45, 94)
dis 3.21 (2.10, 5.19)
rad
    1 20 (4.0%)
    2 24 (4.7%)
    3 38 (7.5%)
    4 110 (22%)
    5 115 (23%)
    6 26 (5.1%)
    7 17 (3.4%)
    8 24 (4.7%)
    24 132 (26%)
tax 330 (279, 666)
ptratio 19.05 (17.40, 20.20)
black 391 (375, 396)
lstat 11 (7, 17)
medv 21 (17, 25)
1 Median (IQR); n (%)
Boston <- Boston %>% filter(!is.na(medv))
# Split the data into training and test sets
set.seed(203)
data_split <- initial_split(Boston, prop = 0.5)
Bonston_train <- training(data_split)
Bonston_test <- testing(data_split)
tree_recipe <- 
  recipe(
    medv ~ ., 
    data = Bonston_train
  ) %>%
  # # create traditional dummy variables (not necessary for random forest in R)
  # step_dummy(all_nominal()) %>%
  step_naomit(medv) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>% 
  #  center and scale numeric data
  step_log(medv) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) %>%
  # step_normalize(all_numeric_predictors()) %>%
  # estimate the means and standard deviations
  prep(training = Bonston_train, retain = TRUE)
tree_recipe
tree_recipe_spec <- 
  recipe(
    medv ~ ., 
    data = Bonston_train
  ) %>%
  # # create traditional dummy variables (not necessary for random forest in R)
  # step_dummy(all_nominal()) %>%
  step_naomit(medv) %>%
  # zero-variance filter
  step_zv(all_numeric_predictors()) %>%
 #  center and scale numeric data
  step_log(medv) %>%
  step_center(all_numeric()) %>%
  step_scale(all_numeric()) 

tree_recipe_spec
# regression tree model
regtree_mod <- decision_tree(
  cost_complexity = tune(),
  tree_depth = tune(),
  min_n = 5,
  mode = "regression",
  engine = "rpart"
)

# random forest model
rf_mod <- rand_forest(
  mode = "regression",
  engine = "randomForest",
  trees = 500,
  mtry = tune()
)

# boosting model
boost_mod <- boost_tree(
  mode = "regression",
  engine = "xgboost",
  trees = 500,
  mtry = tune(),
  learn_rate = tune()
)
# regression tree model
tree_wf <- workflow() %>%
  add_recipe(tree_recipe_spec) %>%
  add_model(regtree_mod)

# random forest model
rf_wf <- workflow() %>%
  add_recipe(tree_recipe_spec) %>%
  add_model(rf_mod)

# boosting model
boost_wf <- workflow() %>%
  add_recipe(tree_recipe_spec) %>%
  add_model(boost_mod)
# grid for regression tree model
tree_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  levels = c(100, 5)
)

# grid for random forest model
rf_grid <- grid_regular(
  mtry(range = c(2, ncol(Bonston_train) - 1)),
  levels = 5
)

# grid for boosting model
boost_grid <- grid_regular(
  mtry(range = c(2, ncol(Bonston_train) - 1)),
  learn_rate(range = c(0.01, 0.1)),
  levels = 5
)
#### R Set cross-validation partitions.
set.seed(203)

folds <- vfold_cv(Bonston_train, v = 5)
folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [202/51]> Fold1
2 <split [202/51]> Fold2
3 <split [202/51]> Fold3
4 <split [203/50]> Fold4
5 <split [203/50]> Fold5
# cv for regression tree model
tree_fit <- tree_wf %>%
  tune_grid(
    resamples = folds,
    grid = tree_grid,
    metrics = metric_set(rmse, rsq)
  )

# cv for random forest model
rf_fit <- rf_wf %>%
  tune_grid(
    resamples = folds,
    grid = rf_grid,
    metrics = metric_set(rmse, rsq)
  )

# cv for boosting model
boost_fit <- boost_wf %>%
  tune_grid(
    resamples = folds,
    grid = boost_grid,
    metrics = metric_set(rmse, rsq)
  )

Visualize CV results:

tree_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  mutate(tree_depth = as.factor(tree_depth)) %>%
  ggplot(mapping = aes(x = cost_complexity, y = mean, color = tree_depth)) +
  geom_point() + 
  geom_line() + 
  labs(x = "cost_complexity", y = "CV mse")
# A tibble: 1,000 × 8
   cost_complexity tree_depth .metric .estimator  mean     n std_err
             <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl>
 1        1   e-10          1 rmse    standard   0.736     5  0.0548
 2        1   e-10          1 rsq     standard   0.459     5  0.0304
 3        1.23e-10          1 rmse    standard   0.736     5  0.0548
 4        1.23e-10          1 rsq     standard   0.459     5  0.0304
 5        1.52e-10          1 rmse    standard   0.736     5  0.0548
 6        1.52e-10          1 rsq     standard   0.459     5  0.0304
 7        1.87e-10          1 rmse    standard   0.736     5  0.0548
 8        1.87e-10          1 rsq     standard   0.459     5  0.0304
 9        2.31e-10          1 rmse    standard   0.736     5  0.0548
10        2.31e-10          1 rsq     standard   0.459     5  0.0304
   .config               
   <chr>                 
 1 Preprocessor1_Model001
 2 Preprocessor1_Model001
 3 Preprocessor1_Model002
 4 Preprocessor1_Model002
 5 Preprocessor1_Model003
 6 Preprocessor1_Model003
 7 Preprocessor1_Model004
 8 Preprocessor1_Model004
 9 Preprocessor1_Model005
10 Preprocessor1_Model005
# ℹ 990 more rows

rf_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = mtry, y = mean)) +
  geom_point() + 
  geom_line() + 
  labs(x = "mtry", y = "CV mse")
# A tibble: 10 × 7
    mtry .metric .estimator  mean     n std_err .config             
   <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
 1     2 rmse    standard   0.457     5  0.0399 Preprocessor1_Model1
 2     2 rsq     standard   0.786     5  0.0697 Preprocessor1_Model1
 3     4 rmse    standard   0.416     5  0.0440 Preprocessor1_Model2
 4     4 rsq     standard   0.808     5  0.0665 Preprocessor1_Model2
 5     7 rmse    standard   0.402     5  0.0444 Preprocessor1_Model3
 6     7 rsq     standard   0.816     5  0.0595 Preprocessor1_Model3
 7    10 rmse    standard   0.402     5  0.0453 Preprocessor1_Model4
 8    10 rsq     standard   0.814     5  0.0575 Preprocessor1_Model4
 9    13 rmse    standard   0.407     5  0.0442 Preprocessor1_Model5
10    13 rsq     standard   0.811     5  0.0547 Preprocessor1_Model5

boost_fit %>%
  collect_metrics() %>%
  print(width = Inf) %>%
  filter(.metric == "rmse") %>%
  ggplot(mapping = aes(x = mtry, y = mean, color = learn_rate)) +
  geom_point() + 
  geom_line() + 
  labs(x = "mtry", y = "CV mse")
# A tibble: 50 × 8
    mtry learn_rate .metric .estimator  mean     n std_err .config              
   <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
 1     2       1.02 rmse    standard   0.557     5  0.0458 Preprocessor1_Model01
 2     2       1.02 rsq     standard   0.674     5  0.0875 Preprocessor1_Model01
 3     4       1.02 rmse    standard   0.515     5  0.0295 Preprocessor1_Model02
 4     4       1.02 rsq     standard   0.738     5  0.0428 Preprocessor1_Model02
 5     7       1.02 rmse    standard   0.512     5  0.0411 Preprocessor1_Model03
 6     7       1.02 rsq     standard   0.723     5  0.0541 Preprocessor1_Model03
 7    10       1.02 rmse    standard   0.484     5  0.0234 Preprocessor1_Model04
 8    10       1.02 rsq     standard   0.751     5  0.0449 Preprocessor1_Model04
 9    13       1.02 rmse    standard   0.507     5  0.0384 Preprocessor1_Model05
10    13       1.02 rsq     standard   0.739     5  0.0515 Preprocessor1_Model05
# ℹ 40 more rows

tree_fit %>%
  show_best("rmse")
# A tibble: 5 × 8
  cost_complexity tree_depth .metric .estimator  mean     n std_err .config     
            <dbl>      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>       
1         0.00351          8 rmse    standard   0.488     5  0.0537 Preprocesso…
2         0.00351         11 rmse    standard   0.488     5  0.0537 Preprocesso…
3         0.00351         15 rmse    standard   0.488     5  0.0537 Preprocesso…
4         0.00285          8 rmse    standard   0.490     5  0.0545 Preprocesso…
5         0.00285         11 rmse    standard   0.490     5  0.0545 Preprocesso…
rf_fit %>%
  show_best("rmse")
# A tibble: 5 × 7
   mtry .metric .estimator  mean     n std_err .config             
  <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1    10 rmse    standard   0.402     5  0.0453 Preprocessor1_Model4
2     7 rmse    standard   0.402     5  0.0444 Preprocessor1_Model3
3    13 rmse    standard   0.407     5  0.0442 Preprocessor1_Model5
4     4 rmse    standard   0.416     5  0.0440 Preprocessor1_Model2
5     2 rmse    standard   0.457     5  0.0399 Preprocessor1_Model1
boost_fit %>%
  show_best("rmse")
# A tibble: 5 × 8
   mtry learn_rate .metric .estimator  mean     n std_err .config              
  <int>      <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1    10       1.02 rmse    standard   0.484     5  0.0234 Preprocessor1_Model04
2    13       1.14 rmse    standard   0.491     5  0.0447 Preprocessor1_Model15
3    10       1.08 rmse    standard   0.491     5  0.0443 Preprocessor1_Model09
4     7       1.08 rmse    standard   0.504     5  0.0136 Preprocessor1_Model08
5    13       1.02 rmse    standard   0.507     5  0.0384 Preprocessor1_Model05

Let’s select the best model.

best_tree <- tree_fit %>%
  select_best("rmse")
best_tree
# A tibble: 1 × 3
  cost_complexity tree_depth .config               
            <dbl>      <int> <chr>                 
1         0.00351          8 Preprocessor1_Model284
best_rf <- rf_fit %>%
  select_best("rmse")
best_rf
# A tibble: 1 × 2
   mtry .config             
  <int> <chr>               
1    10 Preprocessor1_Model4
best_boost <- boost_fit %>%
  select_best("rmse")
best_boost
# A tibble: 1 × 3
   mtry learn_rate .config              
  <int>      <dbl> <chr>                
1    10       1.02 Preprocessor1_Model04
# Final workflow
final_wftree <- tree_wf %>%
  finalize_workflow(best_tree)
final_wftree
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (regression)

Main Arguments:
  cost_complexity = 0.00351119173421513
  tree_depth = 8
  min_n = 5

Computational engine: rpart 
final_wfrf <- rf_wf %>%
  finalize_workflow(best_rf)
final_wfrf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (regression)

Main Arguments:
  mtry = 10
  trees = 500

Computational engine: randomForest 
final_wfboost <- boost_wf %>%
  finalize_workflow(best_boost)
final_wfboost
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (regression)

Main Arguments:
  mtry = 10
  trees = 500
  learn_rate = 1.02329299228075

Computational engine: xgboost 
# Fit the whole training set, then predict the test cases
final_fittree <- 
  final_wftree %>%
  last_fit(data_split)
final_fittree
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [253/253]> train/test split <tibble> <tibble> <tibble>     <workflow>
final_fitrf <- 
  final_wfrf %>%
  last_fit(data_split)
final_fitrf
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [253/253]> train/test split <tibble> <tibble> <tibble>     <workflow>
final_fitboost <- 
  final_wfboost %>%
  last_fit(data_split)
final_fitboost
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits            id               .metrics .notes   .predictions .workflow 
  <list>            <chr>            <list>   <list>   <list>       <list>    
1 <split [253/253]> train/test split <tibble> <tibble> <tibble>     <workflow>
# Test metrics
final_fittree %>% 
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.575 Preprocessor1_Model1
2 rsq     standard       0.679 Preprocessor1_Model1
final_fitrf %>%
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.367 Preprocessor1_Model1
2 rsq     standard       0.843 Preprocessor1_Model1
final_fitboost %>%
  collect_metrics()
# A tibble: 2 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 rmse    standard       0.545 Preprocessor1_Model1
2 rsq     standard       0.729 Preprocessor1_Model1
# Visualize the final model
library(rpart.plot)
library(randomForest)
library(xgboost)
library(data.table)
final_tree <- extract_workflow(final_fittree)
final_tree
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
n= 253 

node), split, n, deviance, yval
      * denotes terminal node

  1) root 253 252.00000000 -1.721943e-15  
    2) lstat>=0.2389889 90  52.87544000 -9.274618e-01  
      4) crim>=0.4172743 39  19.49364000 -1.457369e+00  
        8) lstat>=1.596959 11   5.76440300 -2.135433e+00  
         16) dis< -1.083781 2   0.14677250 -3.043921e+00 *
         17) dis>=-1.083781 9   3.60010500 -1.933546e+00  
           34) dis>=-1.029693 6   0.37089320 -2.172958e+00 *
           35) dis< -1.029693 3   2.19748600 -1.454722e+00 *
        9) lstat< 1.596959 28   6.68489700 -1.190986e+00  
         18) crim>=0.9254287 11   2.12120200 -1.503941e+00 *
         19) crim< 0.9254287 17   2.78924500 -9.884866e-01  
           38) rm< -0.6474225 4   0.37970260 -1.539559e+00 *
           39) rm>=-0.6474225 13   0.82105820 -8.189259e-01 *
      5) crim< 0.4172743 51  14.05606000 -5.222389e-01  
       10) crim>=-0.4060389 45   9.65170800 -6.259578e-01  
         20) black< 0.2746026 23   5.39535000 -8.422694e-01  
           40) age>=1.071618 2   1.76464300 -1.586705e+00 *
           41) age< 1.071618 21   2.41678100 -7.713708e-01 *
         21) black>=0.2746026 22   2.05507000 -3.998140e-01  
           42) dis< -1.00852 3   0.02951063 -9.425904e-01 *
           43) dis>=-1.00852 19   1.00219000 -3.141124e-01 *
       11) crim< -0.4060389 6   0.28956410  2.556533e-01 *
    3) lstat< 0.2389889 163  78.96246000  5.120955e-01  
      6) rm< 0.5276014 113  21.39101000  1.634739e-01  
       12) dis>=-1.186972 111  13.88784000  1.288850e-01  
         24) rm< -0.2633651 52   5.24499700 -5.418782e-02  
           48) ptratio>=1.016231 7   1.29369100 -4.147737e-01 *
           49) ptratio< 1.016231 45   2.89957100  1.903327e-03 *
         25) rm>=-0.2633651 59   5.36399100  2.902374e-01  
           50) tax>=-1.084652 55   3.39454500  2.437258e-01  
            100) lstat>=-0.4004746 17   0.75930570 -1.817869e-04 *
            101) lstat< -0.4004746 38   1.17144900  3.528423e-01 *
           51) tax< -1.084652 4   0.21444600  9.297718e-01 *
       13) dis< -1.186972 2   0.00000000  2.083157e+00 *
      7) rm>=0.5276014 50  12.79965000  1.299980e+00  
       14) rm< 1.659133 34   3.66122100  1.018091e+00  
         28) dis>=-0.8659154 32   2.35915800  9.745802e-01  
           56) rm< 0.7244216 8   0.58169800  6.630926e-01 *
           57) rm>=0.7244216 24   0.74253100  1.078409e+00 *
         29) dis< -0.8659154 2   0.27215970  1.714267e+00 *
       15) rm>=1.659133 16   0.69565820  1.898994e+00 *
final_tree %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

final_rf <- extract_workflow(final_fitrf)
final_rf
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────

Call:
 randomForest(x = maybe_data_frame(x), y = y, ntree = ~500, mtry = min_cols(~10L,      x)) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 10

          Mean of squared residuals: 0.136993
                    % Var explained: 86.25
final_rf %>%
  extract_fit_engine() %>%
  randomForest::varImpPlot()

final_boost <- extract_workflow(final_fitboost)
final_boost
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_naomit()
• step_zv()
• step_log()
• step_center()
• step_scale()

── Model ───────────────────────────────────────────────────────────────────────
##### xgb.Booster
raw: 388.7 Kb 
call:
  xgboost::xgb.train(params = list(eta = 1.02329299228075, max_depth = 6, 
    gamma = 0, colsample_bytree = 1, colsample_bynode = 0.769230769230769, 
    min_child_weight = 1, subsample = 1), data = x$data, nrounds = 500, 
    watchlist = x$watchlist, verbose = 0, nthread = 1, objective = "reg:squarederror")
params (as set within xgb.train):
  eta = "1.02329299228075", max_depth = "6", gamma = "0", colsample_bytree = "1", colsample_bynode = "0.769230769230769", min_child_weight = "1", subsample = "1", nthread = "1", objective = "reg:squarederror", validate_parameters = "TRUE"
xgb.attributes:
  niter
callbacks:
  cb.evaluation.log()
# of features: 13 
niter: 500
nfeatures : 13 
evaluation_log:
    iter training_rmse
       1  0.3077342515
       2  0.2053184932
---                   
     499  0.0005818133
     500  0.0005818133
boost_model <- final_boost$fit$fit$fit
importance <- xgb.importance(model = boost_model)
importance_df <- as.data.table(importance)
xgboost::xgb.plot.importance(importance_matrix = importance_df)

library(vip)

final_tree %>% 
  extract_fit_parsnip() %>% 
  vip()

final_rf %>%
  extract_fit_parsnip() %>%
  vip()

final_boost %>%
  extract_fit_parsnip() %>%
  vip()

Follow the machine learning workflow to train classification tree, random forest, and boosting methods for classifying Sales <= 8 versus Sales > 8. Evaluate out-of-sample performance on a test set.